Some practice

Let’s practice some usefull commands

Writer
Affiliation
Javier Silva-Valencia

Instituut Voor Tropische Geneeskunde. Antwerp-Belgium

Published

2023-02-27

Abstract
Today we had an introduction to how to create subsets by row, subset by colls, appending (joining up-down), merging (joining left-righ).


Exercise 1

Let’s practice dividing and joining data.

Import CSV

Importing the CSV database under the name of Q1_B, with “,” as separator, and with “.” as decimal:

Q1_B <- read.csv("C:/Users/pined/OneDrive - Universidad Nacional Mayor de San Marcos/Javier 2022/Belgica/AC2_DataAnalysis_ThWk/Material/Q1_B.csv", sep=",", dec= ".")

Create a subset by row

Checking the identification variable of this dataset (FSN)

summary(Q1_B$FSN)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  45001   48527   52032   52110   55671   59606 
Tip

The median is 52032

Let’s divide the data in two parts taking as cut point the half of the rows (The median of FSN)

top <- subset(Q1_B, subset= FSN < 52032)
bottom <- subset(Q1_B, subset= FSN >= 52032)

Seeing the structure of the original dataset and the new subsets

str(Q1_B)     #Original dataset
'data.frame':   13377 obs. of  40 variables:
 $ ID                 : int  1 2 3 4 5 6 7 8 9 10 ...
 $ FSN                : num  45001 45002 45003 45004 45005 ...
 $ Neem_Tree          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Neem_Tree_Distance : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Neem_Tree_Size     : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Neem_Tree_Age      : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Neem_Tree_Usage    : chr  "" "" "" "" ...
 $ Neem_Tree_Use_Other: chr  "" "" "" "" ...
 $ Bamboo_Tree        : num  1 1 1 1 1 1 1 1 1 1 ...
 $ Bamboo_Tree_Dist   : num  3 10 16 17 5 4 15 3 1 10 ...
 $ Banana_Tree        : num  0 0 0 0 0 0 0 1 1 0 ...
 $ Banana_Tree_Dist   : num  -1 -1 -1 -1 -1 -1 -1 5 1 -1 ...
 $ Rice_Field         : num  1 1 1 1 1 1 1 1 1 1 ...
 $ Rice_Field_Dist    : num  4 12 16 13 4 13 10 10 7 12 ...
 $ Perm_Water_Body    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Perm_Wat_Body_Dist : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Wat_Body_Mid_Point : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ No_Mosquito_Net    : num  0 2 1 0 0 0 0 0 1 0 ...
 $ Sprayed_2010       : num  86 86 86 86 86 86 86 86 86 86 ...
 $ Sprayed_2009       : num  86 86 86 86 86 86 86 86 86 86 ...
 $ Floor_Material     : num  153 153 153 153 153 153 153 153 155 153 ...
 $ Other_Floor_Mat    : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Is_Floor_Damp      : num  1 1 1 1 1 1 1 1 0 1 ...
 $ Roof_Material      : num  161 159 158 156 156 156 156 158 158 156 ...
 $ Other_Roof_Material: num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Wall_Material      : num  164 164 164 162 163 163 162 164 164 162 ...
 $ Other_Wall_Material: num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Windows_in_Room    : num  1 1 1 0 0 0 0 1 1 0 ...
 $ Granaries_in_HH    : num  1 1 1 0 1 1 0 1 1 0 ...
 $ Source_Drink_Water : num  167 167 92 92 92 92 92 92 92 92 ...
 $ Other_Src_Drink_Wat: num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Toilet_Facility    : num  177 177 177 177 177 177 177 177 177 177 ...
 $ Other_Toilet_Fac   : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Cooking_Fuel       : num  180 180 180 180 180 180 180 180 180 180 ...
 $ Other_Cooking_Fuel : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Source_Light       : num  182 182 182 182 182 182 182 182 182 182 ...
 $ Other_Source_Light : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Electricity_in_HH  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ No_Of_Rooms        : num  2 2 3 1 2 3 2 1 3 1 ...
 $ No_Sleeping_Rooms  : num  2 2 2 1 2 2 2 1 1 1 ...
str(top)      #Subset top of the data
'data.frame':   6688 obs. of  40 variables:
 $ ID                 : int  1 2 3 4 5 6 7 8 9 10 ...
 $ FSN                : num  45001 45002 45003 45004 45005 ...
 $ Neem_Tree          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Neem_Tree_Distance : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Neem_Tree_Size     : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Neem_Tree_Age      : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Neem_Tree_Usage    : chr  "" "" "" "" ...
 $ Neem_Tree_Use_Other: chr  "" "" "" "" ...
 $ Bamboo_Tree        : num  1 1 1 1 1 1 1 1 1 1 ...
 $ Bamboo_Tree_Dist   : num  3 10 16 17 5 4 15 3 1 10 ...
 $ Banana_Tree        : num  0 0 0 0 0 0 0 1 1 0 ...
 $ Banana_Tree_Dist   : num  -1 -1 -1 -1 -1 -1 -1 5 1 -1 ...
 $ Rice_Field         : num  1 1 1 1 1 1 1 1 1 1 ...
 $ Rice_Field_Dist    : num  4 12 16 13 4 13 10 10 7 12 ...
 $ Perm_Water_Body    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Perm_Wat_Body_Dist : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Wat_Body_Mid_Point : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ No_Mosquito_Net    : num  0 2 1 0 0 0 0 0 1 0 ...
 $ Sprayed_2010       : num  86 86 86 86 86 86 86 86 86 86 ...
 $ Sprayed_2009       : num  86 86 86 86 86 86 86 86 86 86 ...
 $ Floor_Material     : num  153 153 153 153 153 153 153 153 155 153 ...
 $ Other_Floor_Mat    : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Is_Floor_Damp      : num  1 1 1 1 1 1 1 1 0 1 ...
 $ Roof_Material      : num  161 159 158 156 156 156 156 158 158 156 ...
 $ Other_Roof_Material: num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Wall_Material      : num  164 164 164 162 163 163 162 164 164 162 ...
 $ Other_Wall_Material: num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Windows_in_Room    : num  1 1 1 0 0 0 0 1 1 0 ...
 $ Granaries_in_HH    : num  1 1 1 0 1 1 0 1 1 0 ...
 $ Source_Drink_Water : num  167 167 92 92 92 92 92 92 92 92 ...
 $ Other_Src_Drink_Wat: num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Toilet_Facility    : num  177 177 177 177 177 177 177 177 177 177 ...
 $ Other_Toilet_Fac   : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Cooking_Fuel       : num  180 180 180 180 180 180 180 180 180 180 ...
 $ Other_Cooking_Fuel : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Source_Light       : num  182 182 182 182 182 182 182 182 182 182 ...
 $ Other_Source_Light : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Electricity_in_HH  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ No_Of_Rooms        : num  2 2 3 1 2 3 2 1 3 1 ...
 $ No_Sleeping_Rooms  : num  2 2 2 1 2 2 2 1 1 1 ...
str(bottom)   #Subset bottom of the data
'data.frame':   6689 obs. of  40 variables:
 $ ID                 : int  6689 6690 6691 6692 6693 6694 6695 6696 6697 6698 ...
 $ FSN                : num  52032 52033 52034 52035 52036 ...
 $ Neem_Tree          : num  1 1 1 1 1 1 1 1 1 0 ...
 $ Neem_Tree_Distance : num  12 30 38 38 12 16 20 25 45 -1 ...
 $ Neem_Tree_Size     : num  2 2 2 2 2 2 2 2 2 -1 ...
 $ Neem_Tree_Age      : num  2 2 2 2 2 2 2 2 2 -1 ...
 $ Neem_Tree_Usage    : chr  "188_" "188_" "188_" "188_" ...
 $ Neem_Tree_Use_Other: chr  "" "" "" "" ...
 $ Bamboo_Tree        : num  0 0 0 0 1 1 0 0 0 1 ...
 $ Bamboo_Tree_Dist   : num  -1 -1 -1 -1 5 15 -1 -1 -1 18 ...
 $ Banana_Tree        : num  1 1 0 0 1 1 1 1 1 1 ...
 $ Banana_Tree_Dist   : num  1 20 -1 -1 1 16 5 2 1 1 ...
 $ Rice_Field         : num  0 1 1 1 1 1 1 1 1 1 ...
 $ Rice_Field_Dist    : num  -1 1 1 1 10 15 18 12 1 2 ...
 $ Perm_Water_Body    : num  0 0 0 0 1 1 1 1 1 1 ...
 $ Perm_Wat_Body_Dist : num  -1 -1 -1 -1 2 1 8 12 26 17 ...
 $ Wat_Body_Mid_Point : num  -1 -1 -1 -1 1 1 1 1 1 1 ...
 $ No_Mosquito_Net    : num  1 1 1 0 1 4 4 5 1 0 ...
 $ Sprayed_2010       : num  86 86 86 86 86 86 86 86 86 86 ...
 $ Sprayed_2009       : num  86 86 86 86 86 86 86 86 86 86 ...
 $ Floor_Material     : num  153 153 153 153 153 153 155 155 153 153 ...
 $ Other_Floor_Mat    : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Is_Floor_Damp      : num  1 1 1 1 1 1 0 0 1 0 ...
 $ Roof_Material      : num  161 156 156 156 54 158 158 158 161 158 ...
 $ Other_Roof_Material: num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Wall_Material      : num  164 162 162 162 164 165 165 165 164 165 ...
 $ Other_Wall_Material: num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Windows_in_Room    : num  0 0 0 0 1 1 1 1 1 1 ...
 $ Granaries_in_HH    : num  1 1 1 1 1 1 1 1 1 1 ...
 $ Source_Drink_Water : num  92 167 167 92 167 167 167 167 167 167 ...
 $ Other_Src_Drink_Wat: num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Toilet_Facility    : num  177 177 177 177 177 174 174 174 177 174 ...
 $ Other_Toilet_Fac   : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Cooking_Fuel       : num  180 180 180 180 180 180 180 180 180 180 ...
 $ Other_Cooking_Fuel : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Source_Light       : num  182 182 182 182 182 182 182 182 182 182 ...
 $ Other_Source_Light : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Electricity_in_HH  : num  0 0 0 0 0 1 1 1 0 0 ...
 $ No_Of_Rooms        : num  2 2 1 1 3 4 5 4 3 1 ...
 $ No_Sleeping_Rooms  : num  1 2 1 1 2 3 4 3 1 1 ...
Tip

In Q1_B we have 13377 obs

If we sum the observations in the subsets:

Top + bottom: 6688 + 6689 =13377

So, it is ok

Joining (Append)

Let’s join the two subsets (top and bottom)

merged_top_bottom  <-  rbind(top,bottom)  #To merge
str(merged_top_bottom)                    #To see the result
'data.frame':   13377 obs. of  40 variables:
 $ ID                 : int  1 2 3 4 5 6 7 8 9 10 ...
 $ FSN                : num  45001 45002 45003 45004 45005 ...
 $ Neem_Tree          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Neem_Tree_Distance : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Neem_Tree_Size     : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Neem_Tree_Age      : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Neem_Tree_Usage    : chr  "" "" "" "" ...
 $ Neem_Tree_Use_Other: chr  "" "" "" "" ...
 $ Bamboo_Tree        : num  1 1 1 1 1 1 1 1 1 1 ...
 $ Bamboo_Tree_Dist   : num  3 10 16 17 5 4 15 3 1 10 ...
 $ Banana_Tree        : num  0 0 0 0 0 0 0 1 1 0 ...
 $ Banana_Tree_Dist   : num  -1 -1 -1 -1 -1 -1 -1 5 1 -1 ...
 $ Rice_Field         : num  1 1 1 1 1 1 1 1 1 1 ...
 $ Rice_Field_Dist    : num  4 12 16 13 4 13 10 10 7 12 ...
 $ Perm_Water_Body    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Perm_Wat_Body_Dist : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Wat_Body_Mid_Point : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ No_Mosquito_Net    : num  0 2 1 0 0 0 0 0 1 0 ...
 $ Sprayed_2010       : num  86 86 86 86 86 86 86 86 86 86 ...
 $ Sprayed_2009       : num  86 86 86 86 86 86 86 86 86 86 ...
 $ Floor_Material     : num  153 153 153 153 153 153 153 153 155 153 ...
 $ Other_Floor_Mat    : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Is_Floor_Damp      : num  1 1 1 1 1 1 1 1 0 1 ...
 $ Roof_Material      : num  161 159 158 156 156 156 156 158 158 156 ...
 $ Other_Roof_Material: num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Wall_Material      : num  164 164 164 162 163 163 162 164 164 162 ...
 $ Other_Wall_Material: num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Windows_in_Room    : num  1 1 1 0 0 0 0 1 1 0 ...
 $ Granaries_in_HH    : num  1 1 1 0 1 1 0 1 1 0 ...
 $ Source_Drink_Water : num  167 167 92 92 92 92 92 92 92 92 ...
 $ Other_Src_Drink_Wat: num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Toilet_Facility    : num  177 177 177 177 177 177 177 177 177 177 ...
 $ Other_Toilet_Fac   : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Cooking_Fuel       : num  180 180 180 180 180 180 180 180 180 180 ...
 $ Other_Cooking_Fuel : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Source_Light       : num  182 182 182 182 182 182 182 182 182 182 ...
 $ Other_Source_Light : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Electricity_in_HH  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ No_Of_Rooms        : num  2 2 3 1 2 3 2 1 3 1 ...
 $ No_Sleeping_Rooms  : num  2 2 2 1 2 2 2 1 1 1 ...
Tip

There are 13377 obs

It is the same of the original, So it is ok

Create a subset by coll

Seeing the names of the collums (name of the variables)

names(Q1_B)
 [1] "ID"                  "FSN"                 "Neem_Tree"          
 [4] "Neem_Tree_Distance"  "Neem_Tree_Size"      "Neem_Tree_Age"      
 [7] "Neem_Tree_Usage"     "Neem_Tree_Use_Other" "Bamboo_Tree"        
[10] "Bamboo_Tree_Dist"    "Banana_Tree"         "Banana_Tree_Dist"   
[13] "Rice_Field"          "Rice_Field_Dist"     "Perm_Water_Body"    
[16] "Perm_Wat_Body_Dist"  "Wat_Body_Mid_Point"  "No_Mosquito_Net"    
[19] "Sprayed_2010"        "Sprayed_2009"        "Floor_Material"     
[22] "Other_Floor_Mat"     "Is_Floor_Damp"       "Roof_Material"      
[25] "Other_Roof_Material" "Wall_Material"       "Other_Wall_Material"
[28] "Windows_in_Room"     "Granaries_in_HH"     "Source_Drink_Water" 
[31] "Other_Src_Drink_Wat" "Toilet_Facility"     "Other_Toilet_Fac"   
[34] "Cooking_Fuel"        "Other_Cooking_Fuel"  "Source_Light"       
[37] "Other_Source_Light"  "Electricity_in_HH"   "No_Of_Rooms"        
[40] "No_Sleeping_Rooms"  

Let’s divide the data in two parts taking as cut point the half of the columns

We have to be careful to keep an identification variable in each subset (in this case is FSN)

Left <- subset(Q1_B, select=c("ID", "FSN", "Neem_Tree", "Neem_Tree_Distance", "Neem_Tree_Size", "Neem_Tree_Age", "Neem_Tree_Usage", "Neem_Tree_Use_Other", "Bamboo_Tree", "Bamboo_Tree_Dist", "Banana_Tree", "Banana_Tree_Dist", "Rice_Field", "Rice_Field_Dist", "Perm_Water_Body", "Perm_Wat_Body_Dist",  "Wat_Body_Mid_Point", "No_Mosquito_Net", "Sprayed_2010", "Sprayed_2009", "Floor_Material"))
str(Left)
'data.frame':   13377 obs. of  21 variables:
 $ ID                 : int  1 2 3 4 5 6 7 8 9 10 ...
 $ FSN                : num  45001 45002 45003 45004 45005 ...
 $ Neem_Tree          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Neem_Tree_Distance : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Neem_Tree_Size     : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Neem_Tree_Age      : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Neem_Tree_Usage    : chr  "" "" "" "" ...
 $ Neem_Tree_Use_Other: chr  "" "" "" "" ...
 $ Bamboo_Tree        : num  1 1 1 1 1 1 1 1 1 1 ...
 $ Bamboo_Tree_Dist   : num  3 10 16 17 5 4 15 3 1 10 ...
 $ Banana_Tree        : num  0 0 0 0 0 0 0 1 1 0 ...
 $ Banana_Tree_Dist   : num  -1 -1 -1 -1 -1 -1 -1 5 1 -1 ...
 $ Rice_Field         : num  1 1 1 1 1 1 1 1 1 1 ...
 $ Rice_Field_Dist    : num  4 12 16 13 4 13 10 10 7 12 ...
 $ Perm_Water_Body    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Perm_Wat_Body_Dist : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Wat_Body_Mid_Point : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ No_Mosquito_Net    : num  0 2 1 0 0 0 0 0 1 0 ...
 $ Sprayed_2010       : num  86 86 86 86 86 86 86 86 86 86 ...
 $ Sprayed_2009       : num  86 86 86 86 86 86 86 86 86 86 ...
 $ Floor_Material     : num  153 153 153 153 153 153 153 153 155 153 ...
Rigth <- subset(Q1_B, select=c("FSN", "Other_Floor_Mat", "Is_Floor_Damp", "Roof_Material", "Other_Roof_Material", "Wall_Material", "Other_Wall_Material", "Windows_in_Room", "Granaries_in_HH", "Source_Drink_Water", "Other_Src_Drink_Wat", "Toilet_Facility", "Other_Toilet_Fac", "Cooking_Fuel", "Other_Cooking_Fuel", "Source_Light", "Other_Source_Light", "Electricity_in_HH", "No_Of_Rooms", "No_Sleeping_Rooms"))

#Or select the opposite (With a "-" in front of the c):
#Rigth <- subset(Q1_B, select=-c("ID", "Neem_Tree", "Neem_Tree_Distance", "Neem_Tree_Size", "Neem_Tree_Age", "Neem_Tree_Usage", "Neem_Tree_Use_Other", "Bamboo_Tree", "Bamboo_Tree_Dist", "Banana_Tree", "Banana_Tree_Dist", "Rice_Field", "Rice_Field_Dist", "Perm_Water_Body", "Perm_Wat_Body_Dist",  "Wat_Body_Mid_Point", "No_Mosquito_Net", "Sprayed_2010", "Sprayed_2009", "Floor_Material"))

str(Rigth)
'data.frame':   13377 obs. of  20 variables:
 $ FSN                : num  45001 45002 45003 45004 45005 ...
 $ Other_Floor_Mat    : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Is_Floor_Damp      : num  1 1 1 1 1 1 1 1 0 1 ...
 $ Roof_Material      : num  161 159 158 156 156 156 156 158 158 156 ...
 $ Other_Roof_Material: num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Wall_Material      : num  164 164 164 162 163 163 162 164 164 162 ...
 $ Other_Wall_Material: num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Windows_in_Room    : num  1 1 1 0 0 0 0 1 1 0 ...
 $ Granaries_in_HH    : num  1 1 1 0 1 1 0 1 1 0 ...
 $ Source_Drink_Water : num  167 167 92 92 92 92 92 92 92 92 ...
 $ Other_Src_Drink_Wat: num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Toilet_Facility    : num  177 177 177 177 177 177 177 177 177 177 ...
 $ Other_Toilet_Fac   : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Cooking_Fuel       : num  180 180 180 180 180 180 180 180 180 180 ...
 $ Other_Cooking_Fuel : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Source_Light       : num  182 182 182 182 182 182 182 182 182 182 ...
 $ Other_Source_Light : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Electricity_in_HH  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ No_Of_Rooms        : num  2 2 3 1 2 3 2 1 3 1 ...
 $ No_Sleeping_Rooms  : num  2 2 2 1 2 2 2 1 1 1 ...

Joining (Merge)

Let’s join the two subsets (left and right) according to the FSN

merged_left_right  <-  merge(Left, Rigth, all=TRUE, by = "FSN")
str(merged_left_right)
'data.frame':   13377 obs. of  40 variables:
 $ FSN                : num  45001 45002 45003 45004 45005 ...
 $ ID                 : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Neem_Tree          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Neem_Tree_Distance : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Neem_Tree_Size     : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Neem_Tree_Age      : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Neem_Tree_Usage    : chr  "" "" "" "" ...
 $ Neem_Tree_Use_Other: chr  "" "" "" "" ...
 $ Bamboo_Tree        : num  1 1 1 1 1 1 1 1 1 1 ...
 $ Bamboo_Tree_Dist   : num  3 10 16 17 5 4 15 3 1 10 ...
 $ Banana_Tree        : num  0 0 0 0 0 0 0 1 1 0 ...
 $ Banana_Tree_Dist   : num  -1 -1 -1 -1 -1 -1 -1 5 1 -1 ...
 $ Rice_Field         : num  1 1 1 1 1 1 1 1 1 1 ...
 $ Rice_Field_Dist    : num  4 12 16 13 4 13 10 10 7 12 ...
 $ Perm_Water_Body    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ Perm_Wat_Body_Dist : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Wat_Body_Mid_Point : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ No_Mosquito_Net    : num  0 2 1 0 0 0 0 0 1 0 ...
 $ Sprayed_2010       : num  86 86 86 86 86 86 86 86 86 86 ...
 $ Sprayed_2009       : num  86 86 86 86 86 86 86 86 86 86 ...
 $ Floor_Material     : num  153 153 153 153 153 153 153 153 155 153 ...
 $ Other_Floor_Mat    : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Is_Floor_Damp      : num  1 1 1 1 1 1 1 1 0 1 ...
 $ Roof_Material      : num  161 159 158 156 156 156 156 158 158 156 ...
 $ Other_Roof_Material: num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Wall_Material      : num  164 164 164 162 163 163 162 164 164 162 ...
 $ Other_Wall_Material: num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Windows_in_Room    : num  1 1 1 0 0 0 0 1 1 0 ...
 $ Granaries_in_HH    : num  1 1 1 0 1 1 0 1 1 0 ...
 $ Source_Drink_Water : num  167 167 92 92 92 92 92 92 92 92 ...
 $ Other_Src_Drink_Wat: num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Toilet_Facility    : num  177 177 177 177 177 177 177 177 177 177 ...
 $ Other_Toilet_Fac   : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Cooking_Fuel       : num  180 180 180 180 180 180 180 180 180 180 ...
 $ Other_Cooking_Fuel : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Source_Light       : num  182 182 182 182 182 182 182 182 182 182 ...
 $ Other_Source_Light : num  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
 $ Electricity_in_HH  : num  0 0 0 0 0 0 0 0 0 0 ...
 $ No_Of_Rooms        : num  2 2 3 1 2 3 2 1 3 1 ...
 $ No_Sleeping_Rooms  : num  2 2 2 1 2 2 2 1 1 1 ...

Exercise 2

Let’s practice renaming variables and replacing values

Q1_B_1 <- read.csv("C:/Users/pined/OneDrive - Universidad Nacional Mayor de San Marcos/Javier 2022/Belgica/AC2_DataAnalysis_ThWk/Material/Q1_B_1.csv", sep=",", dec= ".")
str(Q1_B_1)
'data.frame':   9029 obs. of  6 variables:
 $ FSN              : int  45001 45001 45002 45003 45004 45006 45009 45021 45021 45024 ...
 $ Animal_Name      : chr  "Goats" "Poultry (chickens, ducks, pigeon)" "Buffaloes" "Cows/Oxen" ...
 $ Animal_Count     : int  3 1 1 1 2 2 7 3 1 3 ...
 $ Animal_Distance  : int  5 0 4 3 5 0 0 15 5 15 ...
 $ Keep_Inside_House: int  1 1 0 1 1 1 1 0 1 1 ...
 $ Days_Inside_House: int  120 365 -1 210 180 360 360 -1 150 90 ...
head(Q1_B_1)
    FSN                       Animal_Name Animal_Count Animal_Distance
1 45001                             Goats            3               5
2 45001 Poultry (chickens, ducks, pigeon)            1               0
3 45002                         Buffaloes            1               4
4 45003                         Cows/Oxen            1               3
5 45004                             Goats            2               5
6 45006                         Cows/Oxen            2               0
  Keep_Inside_House Days_Inside_House
1                 1               120
2                 1               365
3                 0                -1
4                 1               210
5                 1               180
6                 1               360
Tip

We observe:

  • There are 9029 obs

  • The animal name is quite large sometimes

  • The variable names are quite large

Renaming variables

Renaming variables labels: In collumns 2-5 to “anim”, “count”, “dist”, “indor”, “daysin”

names(Q1_B_1)[c(2,3,4,5,6)] <- c("anim", "count", "dist", "indor", "daysin")
str(Q1_B_1)
'data.frame':   9029 obs. of  6 variables:
 $ FSN   : int  45001 45001 45002 45003 45004 45006 45009 45021 45021 45024 ...
 $ anim  : chr  "Goats" "Poultry (chickens, ducks, pigeon)" "Buffaloes" "Cows/Oxen" ...
 $ count : int  3 1 1 1 2 2 7 3 1 3 ...
 $ dist  : int  5 0 4 3 5 0 0 15 5 15 ...
 $ indor : int  1 1 0 1 1 1 1 0 1 1 ...
 $ daysin: int  120 365 -1 210 180 360 360 -1 150 90 ...
#View(Q1_B_1)
table(Q1_B_1$count)

  -1    0    1    2    3    4    5    6    7    8    9   10   11   12   14   15 
   8   10 4104 3288  885  451  135   83   23   15    2    9    1    3    1    1 
  16   20   30   35   50   60 
   3    1    1    1    2    2 
table(Q1_B_1$dist)

  -1    0    1    2    3    4    5    6    7    8    9   10   11   12   13   14 
  24 3307  492  774  618  255  953  238  173  217   11  578   10   87    7    9 
  15   16   17   18   19   20   22   23   24   25   26   29   30   32   35   38 
 231    9    3   28    4  149   12    2    1   59    1    2   70    1   40    2 
  40   43   45   49   50   55   58   60   68   69   70   75   80   90   95  100 
  31    1   13    1  133   10    1   29    2    3   17   14    4    1    1  163 
 105  110  120  150  180  200  210  250  300  400  500  600 1000 
   3    7    3   60    2   77    2    4   19    9   40    9    3 
table(Q1_B_1$indor)

   0    1   89 
4816 4200   13 
table(Q1_B_1$daysin)

  -1    0    1    3    5    7    8   10   12   15   18   20   30   35   40   45 
4823    1    2    1    4    2    2    8    1    1    1    3   48    2    2    7 
  50   60   65   70   80   90  100  110  120  122  130  150  160  165  175  180 
   2   84    1    1    5  192   23    2  273    1    2   74   13    5    2  466 
 190  200  210  215  220  230  240  245  250  260  265  270  280  300  310  340 
   1   57   11    1    1    4   82    1   10    7    3    3   33   36    1    1 
 350  355  356  360  365 
   1    1    1   21 2699 

Replacing values in observations

table(Q1_B_1$anim)

                        Buffaloes                         Cows/Oxen 
                             1592                              3055 
                             Dogs                             Goats 
                               38                              3816 
                            Other                              Pigs 
                                3                                20 
Poultry (chickens, ducks, pigeon) 
                              505 
Q1_B_1$anim[Q1_B_1$anim =="Buffaloes"] <- "Buff"
Q1_B_1$anim[Q1_B_1$anim =="Goats"] <- "Goat"

#But if we want to keep the 4 (x) first characters in a all rows
Q1_B_1$anim <- substr(Q1_B_1$anim, 0, 3)
table(Q1_B_1$anim)

 Buf  Cow  Dog  Goa  Oth  Pig  Pou 
1592 3055   38 3816    3   20  505 
str(Q1_B_1)
'data.frame':   9029 obs. of  6 variables:
 $ FSN   : int  45001 45001 45002 45003 45004 45006 45009 45021 45021 45024 ...
 $ anim  : chr  "Goa" "Pou" "Buf" "Cow" ...
 $ count : int  3 1 1 1 2 2 7 3 1 3 ...
 $ dist  : int  5 0 4 3 5 0 0 15 5 15 ...
 $ indor : int  1 1 0 1 1 1 1 0 1 1 ...
 $ daysin: int  120 365 -1 210 180 360 360 -1 150 90 ...

We are going to save this data because we are going to use it later

saveRDS(Q1_B_1, "Q1_B_1.RDS")